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Abstract 

Background: A key challenge in the post genome era is to identify genome-wide transcriptional regulatory 
networks, which specify the interactions between transcription factors and their target genes. Numerous methods 
have been developed for reconstructing gene regulatory networks from expression data. However, most of them 
are based on coarse grained qualitative models, and cannot provide a quantitative view of regulatory systems. 

Results: A binding affinity based regulatory model is proposed to quantify the transcriptional regulatory network. 
Multiple quantities, including binding affinity and the activity level of transcription factor (TF) are incorporated into 
a general learning model. The sequence features of the promoter and the possible occupancy of nucleosomes are 
exploited to estimate the binding probability of regulators. Comparing with the previous models that only employ 
microarray data, the proposed model can bridge the gap between the relative background frequency of the 
observed nucleotide and the gene's transcription rate. 

Conclusions: We testify the proposed approach on two real-world microarray datasets. Experimental results show 
that the proposed model can effectively identify the parameters and the activity level of TF. Moreover, the kinetic 
parameters introduced in the proposed model can reveal more biological sense than previous models can do. 



Background 

A challenge facing molecular biology is to develop quan- 
titative, predictive models of gene regulation. The 
advance of high-throughput microarray technique 
makes it possible to measure the expression profiles of 
thousands of genes, and genome-wide microarray data- 
sets are collected, providing a way to reveal the complex 
regulatory mechanism among cells. There are two broad 
classes of gene regulatory interactions: one based on the 
physical interaction' that aim at identifying relationships 
among transcription factors and their target genes 
(gene-to-sequence interaction) and another based on the 
'influence interaction' that try to relate the expression of 
a gene to the expression of the other genes in the cell 
(gene-to-gene interaction). 
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In recent years, researchers have proposed many dif- 
ferent computational approaches to reconstruct gene 
regulatory networks from high-throughput data, e.g. see 
reviews by Bansal et al. and Markowetz and Spang [1,2]. 
These approaches fall roughly into two categories: quali- 
tative and quantitative aspects. Inferring qualitative reg- 
ulatory networks from microarray data has been well 
studied, and a number of effective approaches have been 
developed [3-10]. However, these methods are based on 
coarse grained qualitative models [11,12], and cannot 
provide a realistic and quantitative view of regulatory 
systems. On the other hand, quantitative modelling for 
gene regulatory network is in its infancy. Research on 
quantitative models for genetic regulation has arisen 
only in recent years, and most of them are based on 
classical statistical techniques. Liebermeister et al. [13] 
proposed a linear model for cell cycle-related gene 
expression in yeast based on independent component 
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analysis. Holter et al [14] employ singular value decom- 
position to uncover the fundamental patterns underlying 
gene expression profiles. Pournara et al. [15] and Yu et 
al. [16] proposed the Factor analysis model to describe a 
larger number of observed variables. However, these 
approaches are based on linear regression, and are not 
always being consistent with the observations in bio- 
chemical experiments which are nonlinear. Imoto et al. 
[17] proposed a nonlinear model with heterogeneous 
error variances. This model matches the microarray data 
well but it is not satisfying enough in revealing more 
biological sense. Segal et al. [18] proposed a transcrip- 
tion control network based model and apply their 
model to the segmentation gene network of Drosophila 
melanogaster. They reveal that positional information is 
encoded in the regulatory sequence and input factor dis- 
tribution. However, there is still a little bit of dilemma 
in the model: the activity level of transcription factors is 
difficult to be measured or to be identified. Actually, 
assaying the transcription factors' activity state in a 
dynamic fashion is a major obstacle to the wider appli- 
cation of the kinetic modeling. TFs' activity levels are 
difficult to measure mainly due to two technical limita- 
tions: TFs are often present at low intercellular concen- 
trations and the changes in their activity state can occur 
rapidly due to post-translational modifications. 

Based on the above description, this paper aims to 
describe the transcriptional regulatory network quantita- 
tively. In this work, a Bayesian inference based regula- 
tory model is proposed to quantify the transcriptional 
dynamics. Multiple quantities, including binding energy, 
binding affinity and the activity level of transcription 
factor are incorporated into a general learning model. 
The sequence features of the promoter and the occu- 
pancy of nucleosomes are exploited to derive the bind- 
ing energy. Compared with the previous models, the 
proposed model can reveal more biological sense. 

Results 

Case I: Circadian patterns in rat liver 

Circadian rhythm is a daily time-keeping mechanism 
fundamental to a wide range of species. The basic mole- 
cular mechanism of circadian rhythm has been studied 
extensively. As a real example to test our approach, we 
considered the dynamics of the circadian patterns in rat 
liver. We employ the datasets from Almon et al [19]. 
This experiment was designed to examine fluctuations 
in gene expression in liver within the 24 hour circadian 
cycle in normal animals. Fifty-four male normal Wistar 
rats were housed in a strictly controlled stress free 
environment with light: dark cycles of 12 hr: 12 hr. 
Three animals were sacrificed at each of 18 selected 
time points within the 24 hour cycle. RNA was prepared 
from livers for gene arrays. Time point designations 



reflect time after lights on in hours. For details, please 
refer to Table SI in additional file 1. 

Analysis of the predicted activity levels of transcription 
factors 

To test the proposed model on the above dataset, we 
employ two important transcriptional regulators of 
which activity levels indicate the variation of heat signals 
in a subset of gene circadian network, hsfl and ppara. In 
total, we selected 7 genes to perform posterior inference 
of TF activities. To ensure identifiability, we included 
three genes that are regulated solely by hsfl (HSP110, 
HSPA8 and COL4A1), and two genes that are regulated 
solely by ppara (ACSL1 and HMGCS1). The remaining 
two genes are jointly regulated by hsfl and ppara. These 
genes were chosen since they exhibit the largest variance 
in the microarray time course, and hence are likely to 
provide a cleaner representation of the output of the 
system. 

The inferred TFs' activity levels are shown in Figure 1 
(a) and 1(b). Both inferred TF profiles show a noisy per- 
iodic behaviour [20]. Figure 1(c) gives the values of the 
parameters ki for the four selected circadian genes 
(HSPA8, ACSL1, HSP90AA1 and HSPA1B). The green 
column represents the response ki to hsfl alone, the red 
column is the response k 2 to ppara alone and the black 
column represents the joint response k 12 . It can be seen 
that, for gene, HSPA8, the model predicts a clear activa- 
tion by hsfl alone, which is consistent with the experi- 
mental conclusion from Yan et al [20]. The black 
columns of HSP90AA1 and HSPA1B demonstrate that 
the model predicts a significant combinatorial activation 
which can be verified by mutagenetic techniques, i.e. by 
knocking out one of the TFs. 

The biological sense of kinetic parameters 

Table 1 shows the relationship between scaling para- 
meter k and the corresponding binding affinity 0. In 
table 1, 'H' indicates 'high' and 'V indicates 'low'. We 
define the scaling parameter ki as 'High' if it is bigger 
than the mean value, as 'low', otherwise, and the same 
to binding affinity 0. From Table 1, we can find that, for 
most case, the scaling parameter is in accordance with 
the binding affinity: High scaling parameter coupling 
with high binding affinity, vice versa. However, gene 
COL4A1 and HSP110 are 2 exceptions: they have high 
scaling parameter but low binding affinity. Our view is 
that low binding affinity but high value for k t might 
represent a TF which rarely binds to promoter but can 
strongly regulate gene expression when bound. 

Figure 2 shows the results of inference on the values 
of the parameters c } and 00j. The columns on the left, 
shaded red, show results from our model and the white 
columns are the estimates obtained by the method of 
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Barenco et al. [21]. The parameters were assigned a 
vague gamma prior distribution (a = b = 0.2, corre- 
sponding to a mean of 1 and a variance of 5). The 
results are in good accordance with the results obtained 
by Barenco et. a 1 [21]. We can find that the parameters 
Cj and (Dj obtained by our method have smaller variance 
than that of Barenco et al. Figure 3 shows the fit of the 
model to the observed data at each time-point. 

Case II: A yeast synthetic network for in vivo assessment 

Validation of gene regulation network (GRN) inference 
methods has traditionally been done using in silico net- 
works. However, depending on how realistic the choice 
of an in silico model is, this kind of validation approach 
has obvious limitations. To our knowledge, rarely the 
underlying model from which artificial/simulated data is 
generated is realistic enough. Real biological networks 
are fairly complex chemical reaction network models. In 
simulation setting one typically adds noise on top of a 
hypothetical simulation model, but the noise character- 
istics may not be realistic enough. Also, simulation 
models tend to be overly simplistic when compared to 
e.g. real gene regulatory networks. Data measured from 
a real biological system is, real. To overcome these pro- 
blems, we use the IRMA network to evaluate out model. 
The IRMA network is a synthetically constructed GRN 
in the Saccharomyces cerevisiae genome [22], which is 
designed to be maximally independent in such a way 
that genes in the network are not regulated by genes 
outside of the network (i.e. by other yeast genes). How- 
ever, genes in the IRMA network may regulate other 
genes in the genome. The network consists of five genes 



and there are positive and negative feedback loops and 
one protein to protein interaction, shown as Figure 4. 
Although the IRMA network contains only five genes, 
there are studies suggesting that the performance on 
smaller networks typically generalize to larger networks 
[1,23]. The data samples were collected every 20 min up 
to 5 hr in five independent experiments for the switch- 
on state, and every 10 min up to 3 hr in four indepen- 
dent experiments for the switch-off state. For details on 
the construction of the network and experimental pro- 
cedures, we refer to [22], One of the purposes of the 
IRMA network is to provide a realistic benchmark set 
for computational studies by providing mRNA-level 
measurements from a known GRN. To our knowledge, 
the IRMA network and dataset are the first of a kind 
that are meant for validation purposes. Besides, we 
assume that mRNA decay rates may be gene-specific, 
but remain constant in time [36]. The sequences of all 
promoters are retrieved from SCPD and SGD database. 
The scanning region ranges from -800 to +50 bp of 
each target gene. 

Analysis of the predicted activity levels of transcription 
factors 

To evaluate whether the proposed model can effec- 
tively learn the TFs' activity level and the regulation 
type, we first evaluate the model using the switch-on 
time series data. The inferred TFs' activity levels are 
shown in Figure 5(a) and 5(b). Both inferred TF pro- 
files show a noisy switch-on behavior. Figure 5(c) gives 
the values of the parameters ki for the five target 
genes. The green column represents the response to 
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Figure 2 The bar charts for basal transcription rates and decay rates, (a) Basal transcription rates from our model and that of Barenco et al. 

Red are estimates obtained with our model, white are the estimates obtained by Barenco et al [21]. (b) Similar for decay rates, 
v ) 



the first regulator alone, the red column is the 
response to the second regulator alone and the black 
column represents the joint response, k 12 . It can be 
seen that, for gene, GAL80, the model predicts a clear 
activation by swi5 which is consistent with the experi- 
mental conclusion [22]. For gene CBF1, the red down- 
ward column indicates that ashl behaves as a 
repressor, which is verified in [22]. The black column 
of CBF1 demonstrates that the model predicts a signif- 
icant combinatorial regulation [22]. 



Analysis of the kinetic parameters 

Table 2 shows the relationship between scaling para- 
meter k and the corresponding binding affinity 0. In 
table 2, the definition of 'High' and 'Low' are same as in 
Table 1, and the same abbreviations are employed. It 
can be found that gene GAL80 has the TF that rarely 
binds to promoter but can strongly up-regulate its 
expression when bound. 

Figure 6 shows the results of inference on the values of 
the parameters Cj and coj. The columns on the left, shaded 
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Figure 3 The predicted mean expression profiles, (a) HSPA8, (b) COL4A1, (c) ACSL1, (d) HMGCS1, (e) HSP90AA1 and (f) HSPA1B. The red circle 
indicates the observed value at each time-points. 
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Figure 4 IRMA network. The rectangle indicates the gene while the oval represents the protein. 



red, show results from our model and the white columns 
are the estimates obtained by Opper et al [24]. It can be 
found that the results are in good accordance with the 
results obtained by the method of Opper et AL [24]. It can 
be found that the parameters Cj and coj obtained by our 
method have smaller variance than that of Opper et al. [24], 
For comparison, we also evaluate the model using the 
switch-off data. Figure 7 shows the fit of the model to 



the observed data at each time-point for both the 
switch-on case and switch-off case. 

Discussion 

In this study, two real-world microarray datasets were 
exploited two evaluate the efficiency of the proposed 
model. Comparison shows that the kinetic parameters 
obtained by our method have smaller variance than that 
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Figure 5 Results on IRMA network data, (a) mean activity profile for regulator swi5, (b) mean activity profile for regulator ash1, (c) bar-chart 
representation of the parameters ki. 
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Table 2 Relationship between k and <p for IRMA network 
data. 



Gene 


GAL80 


GAL4 


SWI5 


ASH1 


CBF1-swi5 


CBFI-ashl 


k 


H 


L 


H 


L 


H 


H 


0 


L 


L 


H 


L 


H 


H 



of Barenco et al. [21]. One reason is that the proposed 
model provides a principled method for the incorpora- 
tion of prior biological knowledge. This may be in the 
form of suitable ranges for kinetic parameters, known 
kinetic parameter values and suitable distributions on 
measurement noise. Besides, it is possible for the pro- 
posed model to circumvent the need for expensive sam- 
pling-based inference and a TFA profile can be inferred 
over all time, rather than just at the discrete time-points 
at which expression was measured. 

The Bayesian inference based model of transcription 
rates and regulator activity levels allows us to handle 
these biologically relevant quantities despite the indirect 
measurement of the former and the lack of measure- 
ments of the latter. It also allows us to handle the inher- 
ently noisy measurement in a principled way. However, 
the proposed model still abstracts away some of the 
explicit processes that generate the actual observed 
expression data. A more explicit modelling of these will 
provide a more principled treatment of different sources 
of noise in the data. Furthermore, this model does not 
handle directly the upstream events that affect regulator 
activity. In fact, the current model is an open loop sys- 
tem, such that the regulation of regulator activity is 
itself viewed as exogenous to the system. By developing 
a richer modeling language we may capture more com- 
plex reaction models, model the upstream regulation of 
activity levels, and identify systems that involve feedback 
mechanisms and signalling networks. 

Post-Transcriptional Modification Model (PTMM) 
have been previously used to model TF activities [25]; in 
that work, further dependencies were included between 



TF mRNA expression levels and their predicted activ- 
ities, which enabled to predict possible post-transcrip- 
tional modifications in TFs. Naturally, it should be 
possible to combine both our approach and their 
approach to give a model capable of simultaneously 
inferring TF activities, combinatorial interactions and 
post transcriptional regulations. 

Conclusions 

In this work, we have proposed a computational model 
to reverse engineer simultaneously both the activity of 
TFs and the logical structure of promoters by integrat- 
ing multiple sources of knowledge, including time-series 
gene expression data, TFs' binding information and 
sequence features of promoters. The approach relies on 
a detailed model of transcription, which is an approxi- 
mation to the Michaelis-Menten model from classical 
enzyme kinetics, and therefore should be able to capture 
accurately the effects that changes in TF activity have 
on gene expression dynamics. We testify the proposed 
approach on two real-world microarray datasets. Experi- 
mental results show that the proposed model can effec- 
tively identify the parameters and the activity level of 
TF. Moreover, the kinetic parameters introduced in the 
proposed model can reveal more biological sense than 
previous models can do. 

Methods 

Problem statement 

A microarray experiment only measures the "observed" 
quantities, as shown in Figure 8, whereas the other 
quantities are not observed ("hidden"). The dashed oval 
encloses the closest quantities on the path between the 
TF and the target gene. Consider a transcriptional net- 
work of n genes that are regulated by m regulators, as 
well as a time-dependent external signal. Given the 
structure G and a set X of transcription rates of these 
genes in T time points, is it possible to reconstruct the 
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Figure 7 The predicated mean expression profiles. Expression profile and mean reconstruction of target genes. Switch-on time series: (a) 
GAL80, (b) ASH1, (c) CBF1, (d) GAL4, (e)-(h) The same genes in switch-off time series. The red circle indicates the observed value at each time- 
point. 



time-varying activity level of m regulators, R, at all time 
points and the corresponding model parameters? This is 
an infinite dimensional problem that we tackle by pla- 
cing a stochastic process prior over the activities of 
regulators. 

Our approach relies on a continuous time, differential 
equation description of transcriptional dynamics where 
TFs are treated as latent on/off variables and are 



modelled using a switching stochastic process. The fra- 
mework of the proposed method is shown in the Figure 
9. 

Kinematic model of regulation 

Compared with the gene expression level, the gene tran- 
scription rate can capture more dynamic characteristics 
of transcription regulation. We here employ the 
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Observed | Hidden 

Figure 8 A qualitative molecular model of transcriptional regulation. mRNA encoding a transcription factor (TF) is translated to protein. The 
protein is activated and induces the transcription of a target gene at a certain rate (G). The final accumulation of G mRNA levels is determined 
by this transcription rate and by the rate of G's mRNA degradation. 

V ) 



transcription rate to model the regulation function. We 
first assume: 

♦ The derived transcription rates are average rates 
over a cell population. 

♦ The speed of a TF's binding to or dissociation 
from its target sites is assumed to be much more 
rapid than the transcription process, thus rapid-equi- 
librium approximation can be used. 

Based on the above assumptions, the transcription rate 
of a gene is proportional to the amount of the gene 
bound by its regulators in all genes of the measured cell 
population. We first consider the case that a gene is 
regulated by a single activator. The corresponding regu- 
lation function can be properly described by Michaelis- 
Menten equation: 



dx r(t) 
dt d + r(t) 



+ c - 



cox, 



(1) 



here x represents the mRNA concentration for a parti- 
cular gene, r(t) the concentration of active TF, P and d 
are kinetic constants, c a baseline expression rate and co 
the mRNA decay rate. 

To incorporate the sequence feature and the TF bind- 
ing preference into the model, we set the binding 



affinity cp = k/d, and (1) can be reformulated as 
dx kcpr{t) 



dt 1 + k<pr(t) 



+ c — cox, 



(2) 



here h is a scaling parameter. 

We now take the regulation involving two regulators 
into account. Denote by r^t) and r 2 (t) the concentration 
of two regulators, cfii and c6 2 the binding affinity of two 
regulators from their own target sites, the regulation 
function can be written as below: 



dx _ h(pir(t) 1 +k 2 (p2r 2 (t) + k 3 (pi(p 2 ri(t)r 2 (t) 



dt 



(1 +^ri(t)) (l + <p 2 r 2 {t)) 



cox, (3) 



Considering the general case, a gene is regulated by n 
regulators. There are 2 n different binding states in total. 
The n-dimension binary vector is employed to indicate 
a certain binding state, e.g., a 4-dimension vector (0 10 
1) indicates that the second and the fourth regulators 
are bound to their own target sites while the first and 
the third are not bound. The regulation function can be 
written as: 



dXj_ J2se S] k s n ;= l;i=l,n ^(0 

dt =pj nz.i(i+w(t)) +Cj 



(4) 
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Figure 9 The framework of proposed method. The expression data and sequence features of promoters are incorporated into a general 
learning model. The outputs of the model are kinetic parameters and the activity levels of transcription parameters. 



where Sj denotes the set of all 2 n possible state vec- 
tors, and Si is the i t h element of the state vector s. 

Modelling for binding affinity 

Measuring affinities of molecular interactions in high- 
throughput format remains problematic, especially for 
transient and low-affinity interactions. We here try to 
describe the landscape of binding affinity in the perspec- 
tive of binding energy between the various DNA-binding 
molecules and their target genes. Binding affinity land- 
scapes describe how each molecule translates an input 
DNA sequence into a binding potential that is specific 
to that molecule. The presented framework models sev- 
eral important aspects of the binding process. 

By allowing molecules to bind anywhere along the 
input sequence, the entire range of affinities is consid- 
ered, thereby allowing contributions from both strong 
and weak binding sites [26,27]. 

♦ Conventional cooperative binding interactions can 
be explicitly modelled by assigning higher statistical 



weights to configurations in which two molecules 
are bound in close proximity. 

♦ The cooperativity that arises between factors when 
both nucleosomes and transcription factors are inte- 
grated is captured automatically [28]. 

We first consider the simplest case that there is only 
one target site for TF i in the promoter of gene /: 



TFi + S 



V 



[TFi ■ Sij] 



The site-specific binding affinity is given by 

En 

<p = Qe 



1± 
kT 



(5) 



where Q is a constant, the binding free energy 
between TFi and the promoter of gene h and T are 
the Boltzmann constant and temperature, respectively. 

The above case can be expanded to the general case 
that binding may happen in anywhere along the input 



Wang and Li BMC Systems Biology 2012, 6(Suppl 1):S3 
http://www.biomedcentral.eom/1752-0509/6/S1/S3 



Page 1 0 of 1 3 



sequence and the accessibility of target sites varies due 
to the occupancy of nucleosomes. The general binding 
affinity is modelled as 



<Pij 



(6) 



n=l 



where p ij is the probability of transcription factor i 
binding to the nth binding site in the promoter of gene /. 
Note that the derivation of p (n \j involves the information 
of the possible occupancy of nucleosomes. The nucleo- 
somes positions can be predicted based on the nucleo- 
some-DNA interaction model proposed by Kaplan et al 
[29]. Figure 10(b) shows the occupancy of nucleosomes 
for the genomic sequence shown in the Figure 10(a). 

Since the positional weight matrices (PWM) are often 
used to represent the sequence motif [30,31] > we esti- 
mate the binding energy in perspective of PWM. As the 
background information has been taken into the PWM, 
the binding free energy can be approximately calculated 
as below: 



where ] n \ 



I if n = s(l) 
0 otherwise 



here K (q) is the scaling factor, M* L indicates the maxi- 
mum background frequency in the motif, s(l) is the 
nucleotide in position /. 

Regulatory network modelling using dynamic Bayesian 
inference 

In many biological processes, the transcription factor 
transit from inactive to active state quickly as a conse- 
quence of fast post-translational modifications, (the time 
scale is micro second), so it is reasonable that we model 
the TF activity as a binary variable r(t) e {0,1} [24,32]. 

For the regulation involving a single regulator, the TF 
activity can be seen as a two states Markov Jump Pro- 
cess. Based on Ref [24,33], the probability of the system 
being in a particular state at a given time is given by the 
following Master equation: 



dpijt) 
dt 



n + p 0 {t) - n-pi{t) 



(7) 



i=i t 



£ A (Ml* 

{A,C,G,T} 



■Mi/) 



dpojt) 
dt 



n+pi(i) - n-p 0 (t) 



(8) 




s 

"*WCCATCTTACAGTCCACC ATCC ACWTCTC AC 1 



Ayg, Occupancy 



II 





P. Start 



A|0,U OLE -Oil -143 -143 0.72 0.S6 
C | -0,16 -0.16 461 -L43-I.43 416 -0.61 
G | -0.61 -0.61 iU 1.19 -1.43 -Oil 461 
T 1 0.38 -Oil 461 -1.43 1.19 461-0.61 



KaplanOS: Coverage (p>0.2) 



a n 

Figure 10 Employing sequence features and the occupancy of nuclesomes to estimate the binding affinity. The positional weight matrices 
are used to represent the sequence motif. The binding may occur anywhere along the input sequence, the entire range of affinities is considered. 
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here p x (t) = p(r(t) = 1), p 0 (t) = p(r(t) = 0) and n ± indi- 
cates the transition rate. 

The observed expression data is often assumed to be 
normally distributed [24], We now define a noise model 
that relates the predicted mRNA concentration to the 
observed expression data, as shown in Figure 11. 

Setting yj(t) as the observations of mRNA species / at 
time t y Xj(t) the predicted expression and Gj the varia- 
tion, the noise model can be described as 

Yj {t) |r(t) ~ N{ Xj {t), a 2 ,) 

Based on Refs [24,33], we define the TF's switching 
stochastic process as the prior distribution, then we 
combine the prior distribution and the noise model 
(likelihood) into Bayes' theorem to obtain the posterior 
over the process: 

Pir\y, n) = \p{y\r, Q)p{r) 

where y denotes collectively the observations, £2 are 
the parameters involved in the regulation function and S 
a normalization constant. 

Variational inference and model optimization 

We will use a variational formulation of the inference 
problem [34]. Variational inference is a powerful infer- 
ence method and it has been well applied for optimiza- 
tion by Opper and Sanguinetti [24,33]. Our model 
optimization is based on Ref [22]. Variational inference 
is used as an approximation technique: given an intract- 
able probability distribution p, the variational approach 
finds an optimal approximation q within a certain family 
of distributions. This is usually done by minimizing the 



Kullback-Leibler (KL) divergence between the two distri- 
butions 

KL[q \\p] = E q [log q -] = f log ^{r)dr (9) 

By selecting a suitable family of approximating distri- 
butions, the inference problem is then turned into an 
optimization problem. It can be shown that the KL 
divergence is a convex functional of q and is equal to 
zero iff q = p [24,35]. In this case, we will choose the 
approximating process q to be again a Markov Jump 
Process, so that the required KL is given by 

KLlq\\p p J=KLlq\\p pri J + logS-E q [logp(y\r l Q)] (10) 

here S is a normalization constant, E q [\og p{y\r> D)] 
the expectation of the likelihood of the observations 
under the approximating process. 

According to Ref [24], minimization of the KL func- 
tional (11) can be represented as the saddle point pro- 
blem 

n 2 

II - °~ 9 

/ = maxmin{KL[4 \\pprior] + 2^ hfe *(*/)) y T j 11(11) 

7=1 

here r is auxiliary variables. It can be found that this 
functional is concave in r and convex in q. Hence we 
can exchange min and max. Performing the max first 
yields the result. This also shows that there is only a 
unique saddle point solution. 

The optimization procedure is based on a forward- 
backward procedure, leading to ordinary differential 
equations which can iteratively be solved. Taking the 
regulation involving two regulators for example, the free 
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Figure 12 The framework of the inference. 



energy is a functional of both the approximating pro- 
cesses q\ q 2 and their transition rates n x , n 2 . However, 
these are not independent, but are related by the Master 
equation. To incorporate this constraint, we add 
Lagrange multipliers as 



u, 2 .), h '(l)-n 2+ ]k 2 {t)dt 



(12) 



where %i and g 2 are the rates of jumps from the 0 to 
the 1 state for process q 1 and q 2 , respectively. 

The Lagrange multiplier functions obey the final con- 
dition X(T) = 0. Estimation of the parameters A and b 
can be done directly by maximizing the approximate 
marginal likelihood E q [logp(y\r,{2)]. The framework of 
the inference is shown in the Figure 12. 

Additional material 



Additional file 1: Table SI. The time series gene expression data for 
circadian patterns in rat liver. 
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